
RETAIL: Broken Digital Journeys Demo
Introduction
This document provides a base demo script for showcasing the capabilities of CX 1-2-3 Broken Digital Journeys through a Jupyter Notebook using teradataml (Teradata Python package) and Plotly. The target audience for either on-premises customers and/or Data Scientists.
The nPath function scans a set of rows, looking for patterns that you specify. For each set of input rows that matches the pattern, nPath produces a single output row. The function provides a flexible pattern-matching capability that lets you specify complex patterns in the input data and define the values that are output for each matched input set.
nPath is useful when your goal is to identify the paths that lead to an outcome. For example, you can use nPath to analyze:
The output from the nPath function can be input to other ML Engine functions or to a visualization tool.
This demo shows how to use Vantage nPath to analyze customer digital behavior, experience, and diagnosis of problems with customer experience and flow. It uses a simulated digital retail data set. Using this data we are conducting path analysis. Specifically, it has examples of looking at the most frequent customer journeys and identifying any common events that could potentially be stopping the customer from making a purchase.
1. Start by importing the required libraries and connecting to the Vantage system.
In the section, we import the required libraries and set environment variables and environment paths (if required).
%%capture
# RESTART KERNEL AFTER THIS. NO NEED TO EXECUTE THIS AFTER RESTART.
!pip install teradataml --upgrade
!pip install colorlover
** Restart Kernel before moving to next cell
Import the libraries mentioned in the below cell.
import json
import getpass
import os
import warnings
import datetime
from collections import defaultdict
import pandas as pd
import numpy as np
import teradataml.dataframe.dataframe as tdf
from teradataml.dataframe.dataframe import in_schema
from teradataml.context.context import create_context, remove_context, get_context
from teradataml.dataframe.copy_to import copy_to_sql
from teradataml.dataframe.fastload import fastload
from teradataml.analytics.sqle.Sessionize import Sessionize
from teradataml.analytics.sqle.NPath import NPath
from teradataml.dataframe.dataframe import DataFrame
from teradatasqlalchemy.types import *
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from collections import defaultdict
import plotly.offline as offline
import colorlover as cl
offline.init_notebook_mode()
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell. Begin running steps with Shift + Enter keys.
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
eng.execute('''SET query_band='DEMO=RetailBrokenDigitalJourneysPython.ipynb;' UPDATE FOR SESSION; ''')
... performing setup ... setup complete
... Logon successful Connected as: teradatasql://demo_user:xxxxx@host.docker.internal/dbc Engine(teradatasql://demo_user:***@host.docker.internal)
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f9a1243d3d0>
2. Getting Data for This Demo
We have provided data for this demo on cloud storage. You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage. There are two statements in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.
#%run -i ../run_procedure.py "call get_data('DEMO_DigitalEvents_cloud');"
# Takes about 12 seconds
%run -i ../run_procedure.py "call get_data('DEMO_DigitalEvents_local');"
# Takes about 20 seconds
That ran for 0:00:22.98 with 10 statements and 0 errors.
Next is an optional step – if you want to see status of databases/tables created and space used.
%run -i ../run_procedure.py "call space_report();"
You have: #databases=3 #tables=73 #views=41 You have used 105.6 MB of 25,731.3 MB available - 0.4% ... Space Usage OK Database Name #tables #views Avail MB Used MB demo_user 72 33 25,701.3 MB 90.5 MB DEMO_DigitalEvents 0 1 0.0 MB 0.0 MB DEMO_DigitalEvents_db 1 0 30.0 MB 15.0 MB DEMO_Financial 0 7 0.0 MB 0.0 MB
5. Analyze the raw data set
Create a DataFrame to get the data from the table created.
df = tdf.DataFrame(in_schema('DEMO_DigitalEvents', 'Digital_Retail_Events'))
df.head(5)
| Customer_Id | datestamp | event | session_id |
|---|---|---|---|
| 32 | 2021-04-17 18:23:29.000000 | Availability Status | 2 |
| 32 | 2021-04-21 20:49:32.000000 | Home Page | 5 |
| 32 | 2021-04-16 13:29:55.000000 | Product Search | 1 |
| 32 | 2021-04-26 05:18:12.000000 | Cart Abandoned | 8 |
| 32 | 2021-04-24 12:21:08.000000 | Availability Status | 7 |
from teradataml import ConvertTo
converted_data = ConvertTo(data = df,
target_columns = ['Customer_Id','datestamp', 'event','session_id'],
target_datatype = ["integer","timestamp","VARCHAR(charlen=20,charset=LATIN,casespecific=NO)","integer"])
converted_data.result
| Customer_Id | datestamp | event | session_id |
|---|---|---|---|
| 3018 | 2021-05-05 23:14:47.500000 | Product Search | 1 |
| 28304 | 2021-05-14 14:34:10.000000 | Search Not Found | 1 |
| 28304 | 2021-05-13 10:23:11.000000 | Product Search | 0 |
| 19719 | 2021-05-09 23:20:24.000000 | Category Search | 4 |
| 19719 | 2021-05-09 23:19:24.000000 | Home Page | 4 |
| 8585 | 2021-05-09 21:03:22.000000 | Home Page | 3 |
| 8585 | 2021-05-07 00:22:04.500000 | Product Search | 1 |
| 8585 | 2021-05-09 20:56:27.000000 | Home Page | 3 |
| 19719 | 2021-05-07 00:49:47.000000 | Home Page | 2 |
| 28304 | 2021-05-16 20:45:18.500000 | Product Search | 3 |
6. Identifying Broken Digital Journey
This section will walkthrough conducting path analysis using retail data. Specifically, it has examples of looking at the most frequent customer journeys and identifying any common events that could potentially be stopping the customer from making a purchase.
6.1. Sankey Charts
Sankey charts can help visualize pathways and volume of events on the most common paths. Teradata VantageCloud does have built-in visualizations, but if users don't have access to these capabilities, or perhaps there is a preference to work in Python, the Plotly package provides Sankey charting capabilities. Below the Python function reads output from nPath as pandas dataframe and plots Sankey chart. This method will be used throughout for plotting Sankey.
#Convert Teradata nPath output to plotly Sankey
#can handle paths up to 999 links in length
import plotly.graph_objects as go
def sankeyPlot(res, title_text="Basic Sankey Path"):
npath_pandas = res
dataDict = defaultdict(int)
eventDict = defaultdict(int)
maxPath = npath_pandas['count_customer_id'].max()
for index, row in npath_pandas.iterrows():
rowList = row['path'].replace('[','').replace(']','').split(',')
pathCnt = row['count_customer_id']
pathLen = len(rowList)
for i in range(len(rowList)-1):
leftValue = str(100 + i + maxPath - pathLen) + rowList[i].strip()
rightValue = str(100 + i + 1 + maxPath - pathLen) + rowList[i+1].strip()
valuePair = leftValue + '+' + rightValue
dataDict[valuePair] += pathCnt
eventDict[leftValue] += 1
eventDict[rightValue] += 1
eventList = []
for key,val in eventDict.items():
eventList.append(key)
sortedEventList = sorted(eventList)
sankeyLabel = []
for event in sortedEventList:
sankeyLabel.append(event[3:])
sankeySource = []
sankeyTarget = []
sankeyValue = []
for key,val in dataDict.items():
sankeySource.append(sortedEventList.index(key.split('+')[0]))
sankeyTarget.append(sortedEventList.index(key.split('+')[1]))
sankeyValue.append(val)
sankeyColor = []
for i in sankeyLabel:
sankeyColor.append('blue')
sankeyChart = dict(
type='sankey',
node = dict(
pad = 15,
thickness = 20,
line = dict(
color = 'black',
width = 0.5
),
label = sankeyLabel,
color = sankeyColor
),
link = dict(
source = sankeySource,
target = sankeyTarget,
value = sankeyValue
)
)
layout = dict(
title = title_text,
font = dict(
size = 10
)
)
link = dict(source = sankeySource, target = sankeyTarget, value = sankeyValue, color='white')
node = dict(label=sankeyLabel, pad=15, thickness=20, color='orange')
data=go.Sankey(link=link, node=node)
# plot
fig=go.Figure(data)
fig.update_layout(
hovermode ='closest',
title = title_text,
title_font_size=20,
font = dict(size = 10, color = 'white'),
plot_bgcolor='black',
paper_bgcolor="#585958"
)
fig.show()
# fig = dict(data=[sankeyChart], layout=layout)
# iplot(fig, validate=False)
6.2 Calling a basic nPath function.
This function allows for matching of complex patterns in the input data, as well as defining the output values for each matched set of rows.
For the below example:
1. Pass the input data by reference.
2. Provide partitioning (customer_id, session_id) and ordering columns.
3. Mode.**OVERLAPPING** vs. **NONOVERLAPPING**
4. Symbols. Create a set of column expression aliases that can be assembled into a pattern to match.
5. Pattern. Compose a pattern to search for across the rows of events. This pattern is composed of Symbols and directives.
6. Result. Since nPath emits a single row per group-of-row matches, Result indicates what columns make up this row and how to aggregate the data.
6.3 Path to Sales Conversion
Let's start the analysis by taking a look at what a successful path looks like. This is accomplished by running the Vantage nPath analytic function and specifying the pattern as from Any Event to Sales Conversion.
copy_to_sql(converted_data.result, table_name = 'npath_data', schema_name = 'demo_user',
if_exists = 'replace')
#Create two symbols and assemble them with directives:
# 1. EVENT Column match the string 'Sales Conversion' as B and all other EVENT as O
# Pattern directs a range of any row (O) between 1 and 4 times preceding 'Sales Conversion' (B) - O{1,4}.B
npath_sessions = NPath(data1 = DataFrame('"demo_user"."npath_data"'),
data1_partition_column = ['Customer_Id','session_id'],
data1_order_column = 'datestamp',
mode = 'NONOVERLAPPING',
symbols = ['event in (\'Sales Conversion\') as B', 'event not in (\'Sales Conversion\') as O'],
pattern = 'O{1,4}.B',
result = ['FIRST (customer_id OF O) AS customer_id',
'ACCUMULATE (event of any(O,B) ) AS path'])
npath_sessions.result\
.groupby(['path'])\
.count()\
.sort('count_customer_id',ascending=False)\
.to_pandas()\
.head(10)
| path | count_customer_id | |
|---|---|---|
| 0 | [Add To Cart, Sales Conversion] | 35 |
| 1 | [Home Page, Product Search, Product Details, A... | 34 |
| 2 | [Product Search, Home Page, Sales Conversion] | 30 |
| 3 | [Home Page, Availability Status, Sales Convers... | 23 |
| 4 | [Product Search, Sales Conversion] | 20 |
| 5 | [Add To Cart, Product Search, Sales Conversion] | 18 |
| 6 | [Product Search, Add To Cart, Sales Conversion] | 17 |
| 7 | [Home Page, Sales Conversion] | 17 |
| 8 | [Home Page, Product Search, Sales Conversion] | 16 |
| 9 | [Product Search, Product Details, Add To Cart,... | 11 |
Path to Sales Conversion Visualization
In the table view you can see that the most common last step before Sales Conversion is Add to Cart. Visualizing the paths on a Sankey diagram is an effective way to see common digital customer journey patterns.
#warnings.simplefilter(action='ignore', category=DeprecationWarning)
res = npath_sessions.result\
.groupby(['path'])\
.count()\
.sort('count_customer_id',ascending=False)\
.head(50)\
.to_pandas()
sankeyPlot(res,"Path to Sales Conversion")
The TOP 3 events that led to Sales Conversion are Add To Cart ,Product Search and Availability Status.
To check the details of any path or node we can move the mouse pointer over it and check details. For example if you mouse the pointer over the path having the largest width going towards the right most node(Sales Conversion) it shows 55.0, source: Add To Cart, target: Sales Conversion. It means there were 55 events where after adding the product Cart the next step was that the product was bought, which led to Sales.
When the pointer is moved over a Node, for example when the pointer is on the last Node at the right Availability Status it shows incoming flow count: 4 and outgoing flow count: 1 which means that there are 4 different events which lead to Availability Status after which the next event is Sales Conversion.
6.4 Path to Cart Abandoned
The data available tracks customers who abandon their carts, change the Ending Pattern and see if you can learn anything from the common paths to the Cart Abandoned event.
#Create two symbols and assemble them with directives:
# 1. EVENT Column match the string 'Cart Abandoned' as B and all other EVENT as O
# Pattern directs a range of any row (O) between 1 and 4 times preceding 'Cart Abandoned' (B) - O{1,4}.B
npath_sessions = NPath(data1 = DataFrame('"demo_user"."npath_data"'),
data1_partition_column = ['Customer_Id','session_id'],
data1_order_column = 'datestamp',
mode = 'NONOVERLAPPING',
symbols = ['event in (\'Cart Abandoned\') as B', 'event not in (\'Cart Abandoned\') as O'],
pattern = 'O{1,4}.B',
result = ['FIRST (customer_id OF O) AS customer_id',
'ACCUMULATE (event of any(O,B) ) AS path'])
npath_sessions.result\
.groupby(['path'])\
.count()\
.sort('count_customer_id',ascending=False)\
.to_pandas()\
.head(10)
| path | count_customer_id | |
|---|---|---|
| 0 | [Home Page, Product Search, Search Not Found, ... | 538 |
| 1 | [Product Search, Home Page, Search Not Found, ... | 225 |
| 2 | [Product Search, Search Not Found, Cart Abando... | 158 |
| 3 | [Home Page, Search Not Found, Cart Abandoned] | 94 |
| 4 | [Availability Status, Home Page, Product Searc... | 85 |
| 5 | [Availability Status, Product Search, Home Pag... | 69 |
| 6 | [Home Page, Category Search, Product Search, S... | 64 |
| 7 | [Home Page, Product Search, Search Not Found, ... | 62 |
| 8 | [Search Not Found, Home Page, Product Search, ... | 58 |
| 9 | [Home Page, Cart Abandoned] | 50 |
res = npath_sessions.result\
.groupby(['path'])\
.count()\
.sort('count_customer_id',ascending=False)\
.head(50)\
.to_pandas()
sankeyPlot(res,"Path to Cart Abandoned")
The Cart was abondoned by most of the customers of the Search Not Found event occurred.
6.5 Path to Search Not Found
As you can see from the common paths to Cart Abandoned - Search Not Found is experienced by many customers. Explore this further by changing the ending pattern to Search Not Found.
#Create two symbols and assemble them with directives:
# 1. EVENT Column match the string 'Search Not Found' as B and all other EVENT as O
# Pattern directs a range of any row (O) between 1 and 4 times preceding 'Search Not Found' (B) - O{1,4}.B
npath_sessions = NPath(data1 = DataFrame('"demo_user"."npath_data"'),
data1_partition_column = ['Customer_Id','session_id'],
data1_order_column = 'datestamp',
mode = 'NONOVERLAPPING',
symbols = ['event in (\'Search Not Found\') as B', 'event not in (\'Search Not Found\') as O'],
pattern = 'O{1,4}.B',
result = ['FIRST (customer_id OF O) AS customer_id',
'ACCUMULATE (event of any(O,B) ) AS path'])
npath_sessions.result\
.groupby(['path'])\
.count()\
.sort('count_customer_id',ascending=False)\
.to_pandas()\
.head(10)
| path | count_customer_id | |
|---|---|---|
| 0 | [Home Page, Product Search, Search Not Found] | 3175 |
| 1 | [Product Search, Home Page, Search Not Found] | 2017 |
| 2 | [Product Search, Search Not Found] | 973 |
| 3 | [Home Page, Search Not Found] | 934 |
| 4 | [Home Page, Product Search, Home Page, Search ... | 506 |
| 5 | [Home Page, Home Page, Product Search, Search ... | 491 |
| 6 | [Home Page, Product Search, Home Page, Product... | 339 |
| 7 | [Home Page, Product Search, Product Search, Ho... | 234 |
| 8 | [Product Search, Home Page, Home Page, Product... | 225 |
| 9 | [Product Search, Home Page, Product Search, Ho... | 219 |
res = npath_sessions.result\
.groupby(['path'])\
.count()\
.sort('count_customer_id',ascending=False)\
.head(50)\
.to_pandas()
sankeyPlot(res,"Path to Search Not Found")
The Search Not Found event occurred mainly after Product Search from the Home Page.
6.6 Path From Search Not Found
With a significant number of paths leading to Search Not found, as well as the number of customers on these paths, we need to explore more on this.
Now change to swap have 'Search Not Found' as the starting event and see where customers go to after this key event
#Create two symbols and assemble them with directives:
# 1. EVENT Column match the string 'Search Not Found' as A and all other EVENT as O
# Pattern directs a range of any row (O) between 1 and 4 times succeeding 'Search Not Found' (A) - A.O{1,4}
npath_sessions = NPath(data1 = DataFrame('"demo_user"."npath_data"'),
data1_partition_column = ['Customer_Id','session_id'],
data1_order_column = 'datestamp',
mode = 'NONOVERLAPPING',
symbols = ['event in (\'Search Not Found\') as A', 'true as O'],
pattern = 'A.O{1,4}',
result = ['FIRST (customer_id OF O) AS customer_id',
'ACCUMULATE (event of any(A,O) ) AS path'])
npath_sessions.result\
.groupby(['path'])\
.count()\
.sort('count_customer_id',ascending=False)\
.to_pandas()\
.head(10)
| path | count_customer_id | |
|---|---|---|
| 0 | [Search Not Found, Home Page] | 940 |
| 1 | [Search Not Found, Category Search] | 808 |
| 2 | [Search Not Found, Cart Abandoned] | 748 |
| 3 | [Search Not Found, Home Page, Product Search] | 609 |
| 4 | [Search Not Found, Product Search, Home Page] | 507 |
| 5 | [Search Not Found, Product Details] | 248 |
| 6 | [Search Not Found, Product Search] | 226 |
| 7 | [Search Not Found, Home Page, Category Search] | 196 |
| 8 | [Search Not Found, Cart Abandoned, Home Page] | 142 |
| 9 | [Search Not Found, Home Page, Product Search, ... | 119 |
res = npath_sessions.result\
.groupby(['path'])\
.count()\
.sort('count_customer_id',ascending=False)\
.head(50)\
.to_pandas()
sankeyPlot(res,"Path From Search Not Found")
The events that were followed by Search Not Found were mainly Cart Abandoned and Availability Status.
6.7 Storing Customers on a Path
With the above analysis, it can now be confirmed that there is definitely an issue with unsuccessful searches. From here business actions could be taken such as:
The next section will demonstrate how to save a list of customers on any path of interest.
#Create symbols for each event in our desired path and assemble them with directives:
# 1. EVENT Column match the string 'Cart Abandoned' as A
# 2. EVENT Column match the string 'Search Not Found' as B
# 3. EVENT Column match the string 'Product Search' as C
# 4. EVENT Column match the string 'Home Page' as D
# 5. EVENT Column match the string 'Search Not Found' as E
# Pattern directs events in a particular order which in this case is E.D.C.B.A
npath_sessions = NPath(data1 = DataFrame('"demo_user"."npath_data"'),
data1_partition_column = ['Customer_Id','session_id'],
data1_order_column = 'datestamp',
mode = 'NONOVERLAPPING',
symbols = ['event = \'Search Not Found\' as E',
'event = \'Home Page\' as D',
'event = \'Product Search\' as C',
'event = \'Search Not Found\' as B',
'event = \'Cart Abandoned\' as A'],
pattern = 'E.D.C.B.A',
result = ['FIRST (customer_id of ANY (E,D,C,B,A)) AS customer_id',
'ACCUMULATE (event of ANY (E,D,C,B,A)) AS path'])
npath_sessions.result.head()
| customer_id | path |
|---|---|
| 1407 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 1988 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 2522 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 3326 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 6617 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 6981 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 3507 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 1578 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 1185 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 1149 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
# Now we will take the list of distinct customers who have taken this path and
# store it in Vantage table for detailed analysis
res = npath_sessions.result\
.groupby(['customer_id','path'])\
.count()\
.sort('count_path',ascending=False)\
.to_pandas()\
.reset_index()
# Writing clean_input dataframe into MEDIQAN_TEMP_INPUT_TABLE table in Vantage
copy_to_sql(df = res[['customer_id','path']],
table_name = 'CustomersOnPath',
#schema_name = 'target_db', # Uncomment & specify if not using the Vantage Live Default Database, which usually is your QLID.
if_exists="replace")
result_df = tdf.DataFrame('CustomersOnPath')
print("Number of records in CustomersOnPath : "+str(result_df.shape[0]))
print("\n")
print("Sample records: \n")
result_df.head(5)
Number of records in CustomersOnPath : 56 Sample records:
| customer_id | path |
|---|---|
| 1407 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 1988 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 1578 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 1185 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
| 1149 | [Search Not Found, Home Page, Product Search, Search Not Found, Cart Abandoned] |
The table contains both the customer (Customer_id) and Path (the dominant path selected)
This saved segment can be used as input for further analysis, such as clustering to see if there are any commonalities across the customers, products searched or potentially as input into a customer care process to reach out to the customers to acknowledge their bad experience with our online store.
10. Cleanup
Work Tables
eng.execute('DROP TABLE CustomersOnPath;')
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f99dd902940>
eng.execute('DROP TABLE npath_data;')
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f99ddb366d0>
Databases and Tables
The following code will clean up tables and databases created above.
%run -i ../run_procedure.py "call remove_data('DEMO_DigitalEvents');"
#Takes 3 seconds
Removed objects related to DEMO_DigitalEvents. That ran for 0:00:02.73
# Removing the connection and clearing session
remove_context()
True
As you can see from this brief demonstration, the Vantage nPath Analytic function is an effective way to conduct path analysis, such as the one we just viewed. Path analysis can span multiple topics and crosses industries.